{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"align-center\">\n",
    "<a href=\"https://oumi.ai/\"><img src=\"https://oumi.ai/docs/en/latest/_static/logo/header_logo.png\" height=\"200\"></a>\n",
    "\n",
    "[![Documentation](https://img.shields.io/badge/Documentation-latest-blue.svg)](https://oumi.ai/docs/en/latest/index.html)\n",
    "[![Discord](https://img.shields.io/discord/1286348126797430814?label=Discord)](https://discord.gg/oumi)\n",
    "[![GitHub Repo stars](https://img.shields.io/github/stars/oumi-ai/oumi)](https://github.com/oumi-ai/oumi)\n",
    "</div>\n",
    "\n",
    "👋 Welcome to Open Universal Machine Intelligence (Oumi)!\n",
    "\n",
    "🚀 Oumi is a fully open-source platform that streamlines the entire lifecycle of foundation models - from [data preparation](https://oumi.ai/docs/en/latest/resources/datasets/datasets.html) and [training](https://oumi.ai/docs/en/latest/user_guides/train/train.html) to [evaluation](https://oumi.ai/docs/en/latest/user_guides/evaluate/evaluate.html) and [deployment](https://oumi.ai/docs/en/latest/user_guides/launch/launch.html). Whether you're developing on a laptop, launching large scale experiments on a cluster, or deploying models in production, Oumi provides the tools and workflows you need.\n",
    "\n",
    "🤝 Make sure to join our [Discord community](https://discord.gg/oumi) to get help, share your experiences, and contribute to the project! If you are interested in joining one of the community's open-science efforts, check out our [open collaboration](https://oumi.ai/community) page.\n",
    "\n",
    "⭐ If you like Oumi and you would like to support it, please give it a star on [GitHub](https://github.com/oumi-ai/oumi)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# vLLM Inference Engine\n",
    "\n",
    "This notebook demonstrates how to use the `VLLMInferenceEngine` class for inference with Llama 3.3 70B."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Prerequisites\n",
    "\n",
    "## Machine Requirements\n",
    "\n",
    "❗**NOTICE:** This notebook doesn't run on Colab because the GPU is too old to be supported by vLLM.\n",
    "\n",
    "It is recommended to run this notebook on a machine with GPU support, as vLLM is mainly intended to run on GPUs. Llama 3.3 70B requires 140GB VRAM to serve, though we also provide examples below for inference with Llama 3.1 8B, Llama 3.2 1B, and quantized Llama 3.3 70B that require less memory.\n",
    "\n",
    "If your local machine cannot run this notebook, you can instead run this notebook on a cloud platform. The following demonstrates how to open a VSCode instance backed by a GCP node with 4 A100 GPUs, from which the notebook can be run.\n",
    "\n",
    "```bash\n",
    "# Run on your local machine\n",
    "gcloud auth application-default login  # Authenticate with GCP\n",
    "make gcpcode ARGS=\"--resources.accelerators A100:4\"  # 4 A100-40GB GPUs, enough for 70B model. Can also use 2x \"A100-80GB\"\n",
    "```\n",
    "\n",
    "## Oumi Installation\n",
    "\n",
    "First, let's install Oumi and vLLM. You can find more detailed instructions about Oumi installation [here](https://oumi.ai/docs/en/latest/get_started/installation.html). Here, we include Oumi's GPU dependencies.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install oumi[gpu]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Llama Access\n",
    "\n",
    "Llama 3.3 70B is a gated model on HuggingFace Hub. To run this notebook, you must first complete the [agreement](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) on HuggingFace, and wait for it to be accepted. Then, specify `HF_TOKEN` below to enable access to the model if it's not already set.\n",
    "\n",
    "Usually, you can get the token by running this command `cat ~/.cache/huggingface/token` on your local machine."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "if not os.environ.get(\"HF_TOKEN\"):\n",
    "    # NOTE: Set your Hugging Face token here if not already set.\n",
    "    os.environ[\"HF_TOKEN\"] = \"<MY_HF_TOKEN>\"\n",
    "hf_token = os.environ.get(\"HF_TOKEN\")\n",
    "print(f\"Using HF Token: '{hf_token}'\")\n",
    "\n",
    "# This is needed for vLLM to use multiple GPUs in a notebook.\n",
    "# If you're not running in a notebook, you can ignore this.\n",
    "os.environ[\"VLLM_WORKER_MULTIPROC_METHOD\"] = \"spawn\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To download Llama 3.3 70B to your machine before inference, run:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install hf_transfer\n",
    "os.environ[\"HF_HUB_ENABLE_HF_TRANSFER\"] = \"1\"\n",
    "! hf download meta-llama/Llama-3.3-70B-Instruct --exclude original/*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "\n",
    "from oumi.core.configs import InferenceConfig\n",
    "from oumi.core.types import Conversation, Message, Role\n",
    "from oumi.inference import VLLMInferenceEngine"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# If we have multiple GPUs, we can use Ray to parallelize the inference.\n",
    "# This is essential if you're running a model that's too big to fit in a single GPU.\n",
    "\n",
    "import ray\n",
    "\n",
    "if torch.cuda.is_available() and torch.cuda.device_count() >= 2:\n",
    "    ray.shutdown()\n",
    "    ray.init(num_gpus=torch.cuda.device_count())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Setting up the config file\n",
    "\n",
    "Note: in this section we are writing the config file to the current working directory.\n",
    "\n",
    "An alternative option is to initialize the params classes directly: `ModelParams`, `GenerationParams`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "config_path = \"vllm_tutorial_llama70b_infer.yaml\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile vllm_tutorial_llama70b_infer.yaml\n",
    "\n",
    "model:\n",
    "  # model_name: \"meta-llama/Llama-3.1-8B-Instruct\"  # 8B model, requires 1x A100-40GB GPUs\n",
    "  model_name: \"meta-llama/Llama-3.3-70B-Instruct\"  # 70B model, requires 4x A100-40GB GPUs\n",
    "  model_max_length: 512\n",
    "  torch_dtype_str: \"bfloat16\"\n",
    "  trust_remote_code: True\n",
    "  attn_implementation: \"sdpa\"\n",
    "\n",
    "generation:\n",
    "  max_new_tokens: 128\n",
    "  batch_size: 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load the model and the inference engine"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "# Download, and load the model in memory\n",
    "# This may take a while, depending on your internet speed.\n",
    "# The inference engine only needs to be loaded once and can be\n",
    "# reused for multiple conversations.\n",
    "\n",
    "config = InferenceConfig.from_yaml(config_path)\n",
    "\n",
    "inference_engine = VLLMInferenceEngine(\n",
    "    config.model,\n",
    "    tensor_parallel_size=torch.cuda.device_count(),  # use all available GPUs\n",
    "    # Adjustments to help Llama-3.3-70B-Instruct run on 4 A100-40GB GPUs\n",
    "    enable_prefix_caching=False,\n",
    "    gpu_memory_utilization=0.95,\n",
    "    max_num_seqs=10,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Preprocessing our inputs\n",
    "\n",
    "The inference engine expects a list of conversations, where each conversation is a list of messages.\n",
    "\n",
    "See the [Conversation](https://github.com/oumi-ai/oumi/blob/38b3d2b27407be5fc9be5a1dd88f9ad518f3491c/src/oumi/core/types/turn.py#L109) class for more details.\n",
    "\n",
    "Tip: you can visualize how the conversation is rendered as a prompt with the following:\n",
    "\n",
    "```python\n",
    "inference_engine.apply_chat_template(conversation, tokenize=False)\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "conversations = [\n",
    "    Conversation(\n",
    "        messages=[\n",
    "            Message(\n",
    "                role=Role.SYSTEM, content=\"Translate the following text into French.\"\n",
    "            ),\n",
    "            Message(role=Role.USER, content=\"Hello, how are you?\"),\n",
    "        ]\n",
    "    ),\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Running inference\n",
    "\n",
    "Under the hood, the vLLM engine will batch the conversations to run inference with a high throughput.\n",
    "\n",
    "Make sure to feed all your prompts to the engine at once for maximum throughput."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "\n",
    "print(f\"Running inference for {len(conversations)} conversations\")\n",
    "\n",
    "generations = inference_engine.infer(\n",
    "    input=conversations,\n",
    "    inference_config=config,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "for conversation in generations:\n",
    "    print(repr(conversation))\n",
    "    print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Bonus: Running quantized GGUF models\n",
    "\n",
    "You can also run quantized GGUF models, by downloading the model file and passing it to the engine.\n",
    "\n",
    "For example, to run the Llama 3.3 70B model quantized at 4-bit, you can do the following: \n",
    "\n",
    "First, we download the GGUF model file. There are multiple quantization schemes available, here we choose the `Q4_K_S` scheme which is 4-bit with the `K_S` quantization algorithm."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "from huggingface_hub import hf_hub_download\n",
    "\n",
    "repo_id = \"bartowski/Llama-3.3-70B-Instruct-GGUF\"\n",
    "filename = \"Llama-3.3-70B-Instruct-Q4_K_S.gguf\"\n",
    "\n",
    "# Will download the model in the current working directory instead of HF_CACHE_DIR\n",
    "model_path = hf_hub_download(repo_id, filename=filename, local_dir=\".\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We then update the config file to point to the model we just downloaded:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%writefile vllm_tutorial_llama70b_infer.yaml\n",
    "\n",
    "model:\n",
    "  # Filepath to the GGUF model, which we just downloaded, see `model_path` output above\n",
    "  model_name: \"Meta-Llama-3.1-70B-Instruct-Q4_K_S.gguf\"  \n",
    "  # GGUF files do not have a config. We need to specify the tokenizer name manually.\n",
    "  tokenizer_name: \"meta-llama/Llama-3.3-70B-Instruct\"  \n",
    "  model_max_length: 512\n",
    "  torch_dtype_str: \"float16\"  # GGUF models require float16\n",
    "  trust_remote_code: True\n",
    "  attn_implementation: \"sdpa\"\n",
    "\n",
    "generation:\n",
    "  max_new_tokens: 128\n",
    "  batch_size: 1"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "oumi",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
